Home ownership is synonymous with security and wealth.
The global housing market faces increasing fascination and criticism due to rising prices.
With low supply and high demand of housing, these rising prices aren’t slowing, and is of increasing concern within our generation.
House prices in Sydney averaged near $27,500 in 1970, worth about $250,000 in today’s prices. Comparatively, the current median value of house prices in Sydney is $1.1 million.
Why is predicting house prices important?
Knowledge of factors influencing price is a valuable financial literacy skill for young adults in this environment
It informs decisions about property buying or investment & is crucial in helping gauge the fair market value of a property.
It’s an essential skill for university students facing life ahead.
Our dataset is based on information collected on houses in Saratoga County, New York, USA in 2006, containing 1734 observations on 17 variables. Test variable was ignored as its meaning was unknown.
| Price = price of the house | Lot.Size = size of the house’s lot in acre |
| Age = age of the house in years | Land.Value = value of land ($USD) |
| Living.Area = living area in square feet | Pct.College = percentage of neighbourhood that graduated college |
| Bedrooms = number of bedrooms | Fireplaces = number of fireplaces |
| Bathrooms = number of bathrooms | Rooms = number of rooms |
| Heating.Type = type of heating system | Fuel.Type = type of fuel used for heating |
| Sewer.Type = type of sewer system | Waterfront = whether property includes waterfront |
| New.Construction = whether the property is a new construction | Central.Air = whether the house has central air |
In our problem, Price was the dependent variable to be predicted and all other variables are considered independent variables influencing Price.
Data Processing before Analysis
Checked that all columns had valid entries
Checked that all columns did not have missing entries
To check for the correlation between Price and other properties influencing it a linear regression analysis was completed. The models we explored included:
R squared - represents proportion of variance in the dependent variable that is explained by the independent variables (in-sample)
Adjusted R Squared - R squared but takes into account number of independent variables to address overfitting (in-sample)
RMSE - average magnitude of errors/residuals between predicted and observed value (out-sample)
MAE - absolute magnitude of errors/residuals between predicted and observed value (out-sample)
| \(R^2\) | Adjusted \(R^2\) | RMSE | MAE | |
| Full Model | 0.6553 | 0.6509 | 58014.45 | 415380.38 |
| Log Transform | 0.5941 | 0.5889 | 0.2915 | 0.2077 |
| Stepwise Forward | 0.5919 | 0.5889 | 0.2926 | 0.2077 |
| Stepwise Backward | 0.5935 | 0.5897 | 0.2927 | 0.2080 |
The visualization of the Full Model was not created because it is necessary to log-transform to fit the assumption. Additionally, the model is on a different scale compared to the logged versions.
Linearity - Comparison of linear relationships between independent variable and following dependent variables, conformation to linearity is largely evident in some plots only
Independence - Used Durbin Watson Test to check for autocorrelation. For the Full Model, the DW value was 1.6595 which indicated that Full Model meets the independence assumption
Homoskedasticity - residuals are getting more spread out into a funnel shape, violating homoskedasticity
Normality - large spike in the residuals near the top end of the data and a drop in the tail, meaning that the extremities of the set do not conform to the normality assumption
Linearity - Conformation to linearity is far more consistent across dependent variables post log transformation
Independence - 1.5627, indicates independence met
Homoskedasticity - less funnel shape looking, meaning residuals are not getting more spread out and displays a more constant pattern
Normality - comparatively more normal looking than full model, ends are still slightly off the line
As the Stepwise Models utilise log transformation, the assumptions found was similar to Log Transformation Models - the following is an example of the Stepwise Forward.
Linearity - Conformation to linearity is far more consistent across dependent variables
Independence - 1.5672, independence met
similar Homoskedasticity and Normality observations to log transformation model.
For our final model we chose the Stepwise Forward model. While the Full Model performed slightly better the Stepwise did not violate the assumptions to the same degree as the Full Model. Multi collinearity was found in the backwards step wise model and the full log model, but not the step wise forward model.
Stepwise Forward Model:
\(log(price) = 6.85 + 0.51 log(Living Area) + 0.13 log(Land Value) + 0.11 Bathrooms +\) \(0.53 Waterfront + 0.08 Heat Type (Hot Air) + 0.06 Heat Type (Hot Water) -\) \(0.35 Heat Type (None) + 0.04 Lot Size - 0.001 Age - 0.11 New Construct -\) \(0.002 Percent College + 0.04 Central Air + 0.01 Room\)
Intercept = 6.85
Stepwise Forward Model was a better Price prediction model as it chose stable predictors which are more relevant for explaining the variations in Price. Proven with following sample interaction plots of Rooms vs Lot.Size.
Assumptions not precisely met: Concern of outliers and tails from normal QQ plot, but central limit theorem applies with sample size > 30
Dependence on AIC: Doesn’t consider all possible combinations of predictors, which makes us potentially miss out on an optimal model. Variables included are dependent on AIC
Prediction of Price per square foot. It’s recognized as a better metric for determining property desirability and quality
Further neighborhood, demographic information and occupant status (renters or owners) for more accurate analysis
Apply our research to an additional local area to determine the relevance of our model